refactor(docker): split gateway/supervisor Dockerfiles and use native rust builds#1316
refactor(docker): split gateway/supervisor Dockerfiles and use native rust builds#1316TaylorMutch wants to merge 3 commits into
Conversation
… images Drop the in-Docker BUILD_FROM_SOURCE path so both images consume only prebuilt binaries staged natively via tasks/scripts/stage-prebuilt-binaries.sh. This mirrors what CI does and reuses the host's cargo target cache and sccache across rebuilds. - Dockerfile.gateway: nvcr.io/nvidia/distroless/cc:v4.0.4 base (the 4.0.0 tag does not exist on nvcr.io; the registry uses a v prefix). GNU-linked binary copied to /usr/local/bin. - Dockerfile.supervisor: scratch base, static musl binary. Static linkage lets the image stay scratch while still being executable as a Kubernetes init container. - skaffold.yaml: each artifact invokes tasks/scripts/docker-build-image.sh, which stages the binary natively (cargo / cargo-zigbuild) and then builds the image. Drops the cross-build.sh dependency from the supervisor build. - seccomp.rs: add a local SYS_kexec_file_load constant for musl/aarch64. libc 0.2.185 omits the symbol from its musl/aarch64 bindings, so the supervisor's seccomp filter previously failed to compile for that target. - architecture/build.md: describe the native-first pipeline and per-image runtime choices. Local validation: gateway image 101MB (was 194MB), supervisor image 21.7MB. helm:skaffold:run deploys cleanly; the static musl supervisor binary runs correctly in a non-glibc agent container.
|
Label |
|
/ok to test f763611 |
Fixes formatting drift flagged by `cargo fmt --all -- --check`.
|
/ok to test 59a089afd5f44a4c46cf5a4eb3712f99066014fc |
@TaylorMutch, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 59a089a |
| // libc 0.2.185 omits `SYS_kexec_file_load` from the musl/aarch64 bindings even | ||
| // though the kernel exposes syscall 294. Fall back to the literal so the | ||
| // supervisor's seccomp filter still blocks fileless kernel-image loads when | ||
| // built statically against musl on aarch64. | ||
| #[cfg(all(target_arch = "aarch64", target_env = "musl"))] | ||
| #[allow(non_upper_case_globals)] | ||
| const SYS_kexec_file_load: libc::c_long = 294; | ||
| #[cfg(not(all(target_arch = "aarch64", target_env = "musl")))] | ||
| use libc::SYS_kexec_file_load; |
There was a problem hiding this comment.
This failed to work on macOS locally, so was added to get it working with a static build.
|
Label |
| libc::SYS_finit_module, | ||
| libc::SYS_delete_module, | ||
| libc::SYS_kexec_load, | ||
| libc::SYS_kexec_file_load, |
There was a problem hiding this comment.
Should this be dependant on the constant defined or imported above? https://github.com/NVIDIA/OpenShell/pull/1316/changes#r3223215941
| # the dynamic loader needed by the GNU-linked gateway binary while keeping the | ||
| # attack surface small. | ||
|
|
||
| ARG GATEWAY_BASE_IMAGE=nvcr.io/nvidia/distroless/cc:v4.0.4 |
There was a problem hiding this comment.
We would need to update
for this change.The image choice in the job is pretty arbitrary and should work as long as nvidia-smi can run there. I wanted to avoid using ubuntu directly, but that would also be possible.
| DOCKER_TARGET="supervisor" | ||
| DOCKERFILE="deploy/docker/Dockerfile.supervisor" | ||
| ;; | ||
| supervisor-output) |
There was a problem hiding this comment.
out of scope for this PR, but would matching on supervisor|supervisor-output make sense here?
Summary
Splits the combined
Dockerfile.imagesinto per-image Dockerfiles (gateway and supervisor) and switches local image builds to use native Rust binaries instead of compiling inside Docker. This matches the CI pipeline and makes local image rebuilds much faster because the host's cargo target cache and sccache are reused across iterations.Related Issue
N/A — incremental cleanup of the Docker build pipeline.
Changes
deploy/docker/Dockerfile.gateway(new) —nvcr.io/nvidia/distroless/cc:v4.0.4base, GNU-linked binary at/usr/local/bin/openshell-gateway. Image dropped from 194MB → 101MB.deploy/docker/Dockerfile.supervisor(new) —scratchbase, static musl binary at/openshell-sandbox. Static linkage lets the image stayscratchwhile remaining executable in the Kubernetes init-container copy-self path. Image is 21.7MB.deploy/docker/Dockerfile.images(deleted) — replaced by the split files.deploy/helm/openshell/skaffold.yaml— each artifact now invokestasks/scripts/docker-build-image.sh, which stages binaries natively (cargo/cargo-zigbuild) before running the Docker build. DropsBUILD_FROM_SOURCE=1and thecross-build.shdependency from the supervisor build.tasks/scripts/stage-prebuilt-binaries.sh— supervisor now builds against musl (x86_64-unknown-linux-musl/aarch64-unknown-linux-musl); gateway stays on GNU.tasks/scripts/docker-build-image.sh— routesgateway/supervisortargets to the new per-image Dockerfiles..github/workflows/rust-native-build.yml— supervisor binary build useszig ccmusl wrappers so the static binary can be cross-compiled in CI.crates/openshell-sandbox/src/sandbox/linux/seccomp.rs— adds a localSYS_kexec_file_loadconstant formusl/aarch64.libc 0.2.185omits this symbol from its musl/aarch64 bindings, preventing the supervisor's seccomp filter from compiling for that target. The constant matches the kernel's syscall number (294) so the filter still blocks fileless kernel-image loads.architecture/build.md— rewrites the Container Builds section around the native-first pipeline and the per-image runtime choices.crates/openshell-driver-podman/README.mdand.agents/skills/debug-openshell-cluster/SKILL.mdpointing at the new Dockerfile paths.Testing
tasks/scripts/stage-prebuilt-binaries.sh all(gateway GNU, supervisor static musl).tasks/scripts/docker-build-image.sh gatewayand… supervisorproduce running images;--versionsmoke test passes for both.mise run helm:k3s:create+mise run helm:skaffold:rundeploys cleanly into a local k3d cluster; gateway pod goes1/1 Runningwith healthy startup logs.emptyDir, and the agent container runs it inside a non-glibc image (busybox) successfully.mise run pre-commitclean for files changed in this PR. (Unrelatedcargo fmtdrift indriver.rswill be handled in a separate PR.)Checklist